Easy Multiprocessing on Pandas DataFrames

Python's Pandas library is great for all sorts of data-wrangling tasks. What doesn't come out of the box with Pandas is parallel processing. Here is a simple approach for taking a Pandas DataFrame and a function, and applying the function to chunks of the DataFrame in parallel.

First let's download a dataset


In [40]:
import pandas as pd
import seaborn as sns

df = pd.DataFrame(sns.load_dataset('tips'))
print(df.shape)
print(df.head(3))


(244, 7)
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3

Say we wanted to get tip percentage. We can